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This paper presents a general asymptotic theory of sequential Bayesian estimation giving results 
for the strongest, almost sure convergence. We show that under certain smoothness conditions 
on the probability model, the greedy information gain maximization algorithm for adaptive 
Bayesian estimation is asymptotically optimal in the sense that the determinant of the posterior 
covariance in a certain neighborhood of the true parameter value is asymptotically minimal. 
Using this result, we also obtain an asymptotic expression for the posterior entropy based on a 
novel definition of almost sure convergence on “most trials” (meaning that the convergence holds 
on a fraction of trials that converges to one). Then, we extend the results to a recently published 
framework, which generalizes the usual adaptive estimation setting by allowing different trial 
placements to be associated with different, random costs of observation. For this setting, the 
author has proposed the heuristic of maximizing the expected information gain divided by the 
expected cost of that placement. In this paper, we show that this myopic strategy satisfies an 
analogous asymptotic optimality result when the convergence of the posterior distribution is 
considered as a function of the total cost (as opposed to the number of observations). 
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1. Introduction 

The theoretical framework of this paper is that of Bayesian adaptive estimation with 
an information based objective function (see, e.g., MacKay [9], Kujala and Lukka [7], 
Kujala [6]). Following the notation of Kujala [5, 6], the basic problem we consider is the 
estimation of an unobservable random variable 0 :17 1— > 0 based on a sequence y Xl ,..., y Xt 
of independent (given 0) realizations from some conditional densities p(y Xt | 0) indexed by 
trial placements x t , each of which can be adaptively chosen from some set X based on the 
outcomes (y Xl , ■ ■ ■ i2/x t _i) of the earlier observations. A commonly used greedy strategy 
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is to choose the next placement so as to maximize the expected immediate information 
gain, that is, the decrease of the (differential) entropy of the posterior distribution given 
the next observation. 

Previous work on the asymptotics of Bayesian estimation (see, e.g., Schervish [11], 
van der Vaart [13]) has mostly concentrated on the i.i.d. case, and in the few cases where 
the independent (given 9) but not identical case is considered, it is customarily assumed 
that a certain fixed sequence of variables is given. Hence, these results do not apply to 
the present situation where the sequence X t of placements is also random. 

Paninski [10] has developed an asymptotic theory for this adaptive setting. He states 
consistency and asymptotic normality results for the greedy information maximization 
placement strategy and quantifies the asymptotic efficiency of the method. However, the 
proofs therein are not complete and hence do not provide a sufficient foundation for 
some generalizations and theorems we are interested in. In this paper, we develop a more 
general theory which allows us to generalize the main results of Paninski [10] to almost 
sure convergence (with novel proofs) and to show that the greedy method is in a certain 
sense asymptotically optimal among all placement methods. Furthermore, we provide a 
rigorous and general framework that lends itself to further extensions of the theory. 

One particular extension we are interested in is analyzing the asymptotic properties of 
the novel framework proposed in Kujala [5]. In this framework, the observation of Y x is 
associated with some random cost C x (see Section 4.4 for details). To make measurement 
“cost-effective”, a myopic placement rule is considered that on each trial t maximizes the 
expected value of the information gain (decrease of entropy) 

G t = H(0 | Y Xl ,. ■ - - H(0 \Y Xl ,..-,Y Xt ) 

divided by the expected value of the cost C t = C Xt . This is called a myopic strategy as 
it looks only one step ahead. However, it is not a greedy strategy as it does not optimize 
the immediate gain. 

In Kujala [5], the following fairly simple asymptotic optimality result is given for this 
myopic strategy. 


Theorem 1.1. Suppose that there exists a constant a > 0 such that 

E(G t | y,X t = x) 

max-:-- = a 

xex E (C t | y, X t = x) 


( 1 . 1 ) 


for all possible sets y of past observations. If the next placement X t is defined as the 
maximizer of (1.1) and if for some a 2 < oo and e > 0, 


'Var(G t |F Yl ,...,yx t _ 1 )<«7 2 , 
< Var(Ct | Y Xl ,..., Y Xt _ 1 ) < a 2 , 
E(C t \Y Xl ,...,Y Xt _ 1 )>e 


for all t, then the gain-to-cost ratio satisfies 


.. G 1 + • • • + Gt a.s. 

hm —-— = a 

t—too Ci T • * • T Ct 


( 1 . 2 ) 
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This is asymptotically optimal in the sense that for any other strategy that satisfies (1-2), 
we have 


G, + • • • 4- G, 


■ + Gt 0..S. 

< a 



However, this result requires the obtainable information gains to not decrease over 
time for the optimality condition to make sense and hence does not in general apply 
to smooth models. In this paper, we provide a counterpart of the above result using an 
optimality criterion (D-optimality) relevant to smooth models. 

Our results are structured as follows. In Section 2, we derive strong consistency of the 
posterior distributions under extremely mild, purely topological conditions on the family 
of likelihood functions. In Section 3, we consider the local smoothness assumptions (to be 
assumed in a certain neighborhood of the true parameter value) required for asymptotic 
normality. In Section 4.1, we develop a theory of asymptotic proportions and use it for a 
novel type of convergence of random variables that is required in our analysis. Then, in 
Sections 4.2 and 4.3, we are able to quantify the asymptotic covariance and asymptotic 
entropy of the posterior distribution and to show a form of asymptotic optimality for 
the standard greedy information maximization strategy. In Section 4.4, these results 
are generalized to the situation with random costs of observation associated with each 
placement as discussed above. The heuristically justified, myopic placement strategy 
proposed in Kujala [5] turns out to be asymptotically optimal also in the sense of the 
present paper, supporting the view that this strategy is the most natural generalization 
of the greedy information maximization strategy to the situation where the costs of 
observation can vary. We give concrete examples of the optimality results in Section 5 
and then end with general discussion in Section 6. 

1.1. Preliminaries 

We shall denote random variables by upper case letters and their specific values by lower 
case letters. The information theoretic definitions that we will use are the (differential) 
entropy H(H) = — f p(a ) logp(a) da, which does depend on the parameterization of a, the 
Kullback-Leibler divergence 



which is independent of the parameterization, and the mutual information 




p(b)D KL (p(a | b) || p(a)) db, 
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which is also independent of the parameterization as well as symmetric. Also, the iden¬ 
tities I(A; B) = H(A) — E(H(A | B)) = H(5) — E(H(U | A)) hold whenever the differences 
are well defined. This is all standard notation (see, e.g., Cover and Thomas [3]) except 
that in our notation, there is no implicit expectation over the values of A in H(B | A), and 
so it is a random variable depending on the value of A. Similarly, a conditional density 
p(b | a) as an argument to Dkl(- • •) is treated the same way as any other density of b , 
with no implicit expectation over a. 

The densities p(a) and p(b) above are assumed to be taken w.r.t. arbitrary dominating 
measures “da” and “d6”. Thus, following Lindley [8], we are in fact working in full mea¬ 
sure theoretic generality even though we use the more familiar notation. The underlying 
probability space is (ST, ^ r , P) and so, for example, P{0 £ U} means the probability that 
the value of 0 : H —»• © is within the measurable set U C 0. In some places we may abbre¬ 
viate this by p{U), but it will be clear from the context what random variable is referred 
to. When we say “for a.e. 0”, it is w.r.t. the prior distribution of 0. The tr-algebra of 0 
is assumed to contain at least the Borel sets of the topology which 0 is assumed to be 
endowed with. 

For any fixed x £ X, we assume that the conditional densities p(y x \ 0) are given w.r.t. 
the same dominating er-finite measure “dy x ” for all 9 £ 0 and when we say “for a.e. y x v , 
it is w.r.t. this measure. For brevity, we shall indicate conditioning on the data Y t := 
(Yx i,..., Yx t ) by the subscript t on any quantities that depend on them. For example, 
Pt{&) =p(0 | Y t ) is the posterior density of 0 given Y t and E t (/(0)) = E t (/(0) | Y t ) is 
the posterior expectation of /(0) given Y t . 

It is often assumed that one can observe multiple independent (given 8) copies of 
the same random variable Y x . However, instead of complicating the general notation 
with something like Y x l\ we rely on the fact that the set X can explicitly include 
separate indices for any identically distributed copies, for example, one might have 

\Y( x ,t) I 0] I '~ 1 " \Y{x,t') I 0] f° r a H t,t r £'N, t ^ t '. Hence, we can use the simple notation 
with no loss of generality. 

The greedy information gain maximization strategy can be formally defined as choosing 
the placement X t to be the value x that maximizes the mutual information l t _i(0; Y x ) = 
H t _i(0) — E t _i(H t _i(0 | Y x )), the expected decrease in the entropy of 0 after the next 
observation. In some models, there may be no maximum of the mutual information in 
which case the placement should be chosen sufficiently close to the supremum, which we 
formally define as the ratio of the mutual information and its supremum converging to 
one (condition 04 in Section 4). 

2. Consistency 

The general assumptions for consistency are: 

Cl. The parameter space 0 is a compact topological space. 

C2. The family of log-likelihoods is (essentially) equicontinuous, that is, for all 9 £ 0 
and e > 0, there exists a neighborhood U of 0 such that whenever 9' £ U, 

I log p{y x | d) - log p{y x | 9')\ < e 


Asymptotic optimality of myopic strategies 


5 


for a.e. y x for all igX. 

C3. All points in 0 are statistically distinguishable from each other. That is, for all 
distinct 9,6' £ 0, 

d x (6,6') := J | p{y x \ 9) - p t (y x \ 9')\dy x > 0 

for some x £ X. 

C4. For some 7 > 0, the placements X t satisfy 

l t _i( 0 ;lx t ) > 7 supl£-i( 0 ;y x ) 

for all sufficiently large t,. 

Remark 2.1. These assumptions for consistency are considerably weaker than those 
formulated in Paninski [10]. In particular, the assumptions C1-C3 only pertain to the 
likelihood function p(y x \ 6 ), absolutely nothing is assumed about the prior distribution 
of 0. Furthermore, these assumptions are purely topological in the sense that they are 
preserved by all homeomorphic transformations of 0. Also, in C4, we do not require 
perfect maximization of information gain; this is useful as it allows us to apply the same 
result to the non-greedy strategy discussed in Section 4.4 as well. 

Remark 2.2. Non-compact spaces can be handled if the log-likelihood has an (essen¬ 
tially) equicontinuous extension to a compactification of 0. This happens precisely when 
the following conditions hold: 

Cl'. The parameter space 0 is a topological space. 

C2'. The function f(9) = {{x, y x ) >->• log p(y x \ 9)), with the topology of the target space 
induced by the ([ 0 , oo]-valued) norm 

INI = sup esssup|u(x,y x )|, 

ccGX yx 

is continuous (this is just restating C2) and the closure of the range /(0) is 
compact (this is the extra condition needed for non-compact spaces). 

C3'. For all distinct 9,9 ' e 0, the inequality f{9) ^ f(9') holds true, where equality is 
interpreted w.r.t. a.e. y x . (This is equivalent to C3.) 

In that case, / lifts continuously to the Stone-Cech compactification /30 of 0 (Theo¬ 
rem A.l). Condition C3 may not hold for the points added by the compactification, but 
this can be fixed by moving to the compact quotient space /30/ker(/). Thus, C1-C3 can 
always be replaced by the strictly weaker conditions Cl'-C3'. 

Lemma 2.1. Suppose that C1-C3 hold. Then, there exists a metric d:Ox 0 —>■ R that 
is consistent with the topology of 0, and an estimator 0 t such that for each t there exists 
x £ X such that 


l t (^;0)>Ei(d(0,0 t ) 2 ). 
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Proof. First, we show that the pseudometric d x defined in C3 is continuous in 0 x 0 
for all x £ X. It can be shown using C2 that for any 9 £ © and e > 0, there exists 
a neighborhood Ug^ E such that d x (9,9') < e for all 9' £ Ug >E . Thus, for any e > 0 and 
9 1,$2 £ 0, the triangle inequality implies 

\d x {9[,9' 2 ) — <4(0i,02)| < d x (9i,9[) + d x (92,9' 2 ) < 2e 

whenever (9[,9 2 ) £ Ug liE x Ug 2tE , and so d x is continuous. 

As d x is continuous, the set 


^ = {(M')£0x0: d x (9,9')> 0} 


is open for every x £ X. Now C3 implies that Uxpx covers 0x0, and as 0 x 0 is 
compact, there exists a finite subcover U^ex' &x- It follows that 


d(M') 


_8|X' 


7j (/\ p ( yx I °)~Pt (dx I O') | d y. 


1/2 


is positive definite and hence a metric. Since X' is finite, this metric inherits the continuity 
of d x . 

To show that the topology induced by d coincides with that of 0, let U be an ar¬ 
bitrary open neighborhood of 9q. Then U c is compact and so its continuous image 
S := {d(9 0 ,9): 9 £ U c } is compact, too. It follows that S c is open and as 0 £ S c , we 
obtain [0, 5jj) C S c for some Sjj > 0. Thus, we obtain {9 £ 0: d(9 0 ,9) < Sjj} C U, and so 
the topology induced by d is finer than the default topology of 0. As d is continuous, we 
obtain the converse, and so the topologies coincide. 

Let then t be arbitrary. We extend d{9 1 9') with a special point ^ 0 for which we 
define the distances 


d(0,e t ) 


L8|X'| 


( / I p(y* I 0)-Pt(yx)\dy a 

I6X' 


1/2 


The extended distance function may not be strictly positive definite, but it is still a 
pseudometric and satisfies the triangle inequality. Denoting 

0t = argmind(0, 0 t ), 

6»e0 

we have d(9,Q t ) > d(0*,0t) for all 9 £ 0, and the triangle inequality yields d(9,Q t ) > 
d(0, 0t) — d(0t, ©i)- Adding both inequalities, we obtain 2 d(9, 0*) > d(9 , 0 t ) for all 9 £ 0. 
Now, the ZA-bound of Kullback-Leibler divergence [3], Lemma 11.6.1, yields 

maxI t (F x ;0) > ^ I t( 5 ^; 0 ) 

x ' ' x'ex' 
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7^77 D KL(p(yx | 0) \\p t (y x ))pt{0)d0 
' ' x'ex' 


( L 1 bound) 


J 1 1 x'ex' 


b(yx | 0) -Pt(yx)\dy x 


pt(0)dd 


= 4 / d(0,e t ) 2 p t (d)d8> / d( 0 , 0 t ) 2 Pt( 0 )dd. 


□ 


Lemma 2.2. Suppose that K is a function of 0 and /ias a finite range K. Then, for 
arbitrarily chosen placements X t , the inequality ^2^fi^t-i(K;Yx t ) <00 holds almost 
surely (which implies L_i(AT; Yx t ) —> 0). 

Proof. As l t -i{K\Y Xt ) = H t _i (K) - E t _i(H t (X)), where 0 < H t (AT) < log |K| for all t. 
we obtain 

E ^ Y Xk ) j = E(H 0 (AT) - E t _ 1 (H t (i ; f))) < log |K| 

for all t. As (K ; Y Xf ) is nonnegative, the sequence of partial sums is non-decreasing, 

and Lebesgue’s monotone convergence theorem yields 

EfZ 1 *-!^;^)) < log|K| < 00 , 

which implies the statement. □ 

Lemma 2.3. Suppose that Cl and C2 hold. Then b-i (0; Y Xt ) - £ ^A>0 for arbitrarily cho¬ 
sen placements X t . 


E 


£l fc _i(lf;yxj)= t lim 


\k =1 


Proof. Let e > 0 be arbitrary. As 0 is compact, a finite number of the sets Ug >e given by 
C2 cover it. Thus, we can partition the parameter space into a finite number of subsets 
Ok each one contained in some Ug^ e . Letting the random variable K denote the index of 
the subset that 0 falls into, the chain rule of mutual information yields 

lt- 1 (&-,Y t )=I t _ 1 (0,K- 1 Y t )=l t _ 1 (K-,Y t ) + Y J Pt-i(k)It-i(&;Yt\k), (2.1) 

k 

where Y t :=Y Xt and Lemma 2.2 implies that Yt) -^>0. Let us then look at the 

latter term. Convexity of the Kullback-Leibler divergence yields 


It —1 (©; Yt\k) = / p t -i(0\k)D K i J (p(y t \6)\\pt-i(yt\k))d6 


< JPt-i(& I k) 


p t -i(0' | k)D KL (p(y t | 0) || p(y t | 0'))d0' 


d6 
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<2e for a.e. yt 


< 2 e 




almost surely. As e > 0 was arbitrary, we obtain l t _i(0; Y t ) -^>0. 


□ 


Lemma 2.4. For any measurable function f:Q—tTSL, if the prior expectation E/(0) is 
well-defined and finite, then linp^oo Ej/(0) exists as a finite number almost surely. 

Proof. The finiteness of E/(0) implies that E|/(0)| must also be finite and so Z t := 
Et/(0) satisfies F\Z t \ = E|E t /(0)| < E|/(0)| < oo for all t. Furthermore, since Z t +\ 
depends linearly on the posterior pt+i whose expectation E t (p t+ i) equals the prior p t , 
we obtain F t (Z t +\) = Z t for all t and so Z t is a martingale. As sup t F\Z t \ < E|/(0)| < oo, 
Theorem A.2 implies that lim Z t exists as a finite number almost surely. □ 

Theorem 2.1 (Strong consistency). Suppose that C1-C4 hold. Then, conditioned on 
almost any 9 q £ 0 as the true parameter value, the posteriors are strongly consistent, that 
is, P t {0 £ U} 1 for any neighborhood U o/ 00- 

Proof. As the metric d given by Lemma 2.1 is bounded, Lemma 2.4 implies that 
Hindoo E t (d(0, 9)) exists and is finite for all 9 in a countable dense subset of 0 almost 
surely, in which case continuity of d implies the same for all 9 £ ©. 

Lemmas 2.1 and 2.3 and C4 yield E t (d(0,0 t )) -^4- 0. As d is bounded, Lebesgue’s 
dominated convergence theorem and Markov’s inequality imply 


E(d(0,0 f )) E(Et(d(0,0 t ))) 


P{d(0,0 t )>e}< 


£ 


£ 


/s p 

for all £ > 0 and so d(0, 0t) —»0. Convergence in probability implies that there exists a 
subsequence tk such that e?(0,0t fc ) ——> 0. Thus, conditioned on almost any 9q as the true 
value, we obtain d(0o,0t fc ) ——>0, and the triangle inequality yields 


E tk (d(0, e 0 )) < E tfc (d(0,0 t J)+ d(0o, Q tk ) ^ 0. 


As we have already established that the full sequence E t (d(0,0o)) almost surely con¬ 
verges, it now follows that the limit must almost surely be zero. Thus, given any neigh¬ 
borhood U D Bd(9 Q ,£) of 0 0 , Markov’s inequality yields 


P t {0 G U c } < P t {0 G B d (6 0 , e) c } < 


E«(d(0,0 O )) a . S;0 


□ 


£ 
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Lemma 2.5. Suppose that C1-C3 hold and assume that conditioned on 9$ G 0 as the 
true parameter value, the posteriors are strongly consistent. Then: 

1. Given any metric d consistent with the topology of 0. 

©if := argminEi(d(0,0) 2 ) -^4-0 o . 

2. For any neighborhood U of 9 q there exists a constant c > 0 such that, almost surely, 
I t(Y x ; 0) > cPt{0 G U c } for some igX for all sufficiently large t. 

Proof. Let D be the diameter of 0. The triangle inequality a < b + c implies a 2 < 
(b + c) 2 < 2(& 2 + c 2 ) and so consistency of the posteriors yields 

d(#o,0() 2 < 2E t (d(0,# o ) 2 + d(0,0j) 2 ) < 4E 4 (d(0,0 o ) 2 ) 

< 4(r 2 + D 2 P t {Q G B d (9 0 ,r) c }) ^44(r 2 + D 2 ■ 0) 

for all r > 0, which implies 0* -^4 9q- 

Let us then assume that the metric d is the one given by Lemma 2.1 and choose e > 0 
such that B d (0o, 2e) C U. As 0* we have Bd(0>t,e) C U for all sufficiently large t, 

and so Lemma 2.1 and Markov’s inequality yield 

l t (Y x -e)>E t (d(Q,0 t ) 2 ) 

> E t (d(0, ©:) 2 ) > e 2 Pt{0 G B d (St, s) c } > £ 2 P t{0 G U c } 


for some iGX. □ 

2.1. Asymptotic entropy 

The differential entropy is sensitive to the parameterization, but asymptotically, we can 
in most cases ignore this due to the following lemma. 

Lemma 2.6. Suppose that the prior entropy H(0) is well-defined and finite. Then, 

lim [H t (0) + £>KL(Pt(0)|bW)] 

t—foo 

exists as a finite number almost surely. 

Proof. As H t (0) + DKh(pt(d)\\p(6)) = E t logp(0) and Elogp(0) = —H(0) is well- 
defined and finite, the statement follows from Lemma 2.4. □ 

Lemma 2.7. Suppose that Cl' holds and let f be defined as in C2'. Then, for any subset 
SCO, 


\\ogp t+1 (9 | S)-logpt(6 | 5)| < 2diam f(S) 
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for all 0 £ S. If C2' holds, then this upper bound is finite. 

Proof. Let 0\ £ S be fixed. If p t (6 \ S) is multiplied by p(y x \ 0)/p(y x \ 0i), it can change 
by at most a factor of exp(diam f(S)), and for the same reason, the normalization con¬ 
stant for this density is within a factor of exp(diam/(5')) from 1. The statement follows. 

Suppose then that C2' holds. As /(0) is compact, it follows that f(S) C /(©) must 
be bounded. □ 

Lemma 2.8. Suppose that Cl and C2 hold. Then, for any e > 0, the inequality 
DKh{pt{6) ||p(0)) < st holds true for all sufficiently large t. 

Proof. Let e > 0 be arbitrary. As in the proof of Lemma 2.3, we partition 0 into a finite 
number of subsets 0& such that | log p(y x \ 6) — log p(y x \ Ok )| < e for all 0 £ Qk, y Xl and 
x £ X, where Ok is some fixed point of Qk- Let the random variable K denote the index 
of the subset that 0 falls into. Lemma 2.7 implies that 

|logPt+i(0 | k) - log p t {0 | ft) I < 2e 


for all 0 £ Qk , which yields 


D KL (p t (0\k)\\p(0\k)) = E t llog 


k ) < 2 et 


Pt(e\k) 

p(0 | k) 

for all t and k. The chain rule of Kullback-Leibler divergence now yields 


D K h(pt(0) ||p(60) = D KL (p t (k) \\p{k)) + ^2p t (k)D KL (p t (0 \ k) || p(0 \ k)) 

k 

< logmaxp(fc) -1 + 2et, 
k 

where we may assume that p(k) is positive since we can drop any set 0^ with p{k) = 0 
from the partition. □ 


Lemma 2.9. Suppose that 0Cl“ is bounded and the family of log-likelihoods is uni¬ 
formly Lipschitz, that is, 

I log p(y x I 0) - log p(y x | 0')\ < M\0 - 0'\ 

for all 0,0' £ 0 for all y x and x £ X. Then, for arbitrarily chosen placements X t , the 
expected gain over t trials is bounded by 1(0; Y t ) < n logt + c for some constant c < oo. 


Proof. For each t, we can subdivide the bounded parameter space 0 into < ct n subsets 
Qk, each having diameter < t . Letting the random variable K t denote the index of the 
subset that 0 falls into, the chain rule of mutual information yields 

1(0; Y t ) = I(K t -Y t ) + Y /P (k t ) 1(0; Y t | k t ) < n logt + log c + M (2.2) 

<log(ct Tl ) <M 
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as in equation (2.1) in the proof Lemma 2.3. 


□ 


3. Asymptotic normality 

In this section, we assume that: 

N1. The parameter space 0 is a subset of R". 

N2. The true parameter value 9q is an interior point of 0. 

N3. The log-likelihood 0 log p{y x | 6) is twice continuously differentiable with 

iVelogp)^ | 0)| < M and \\7 2 e \ogp(y x | 0)| < M for all x € X and y x . 

N4. The family of Hessians 9 i—>■ Vg log p{y x \ 9) is equicontinuous at 0o over all x € X 
and y x . 

N5. The prior density is absolutely continuous w.r.t. the Lebesgue measure with pos¬ 
itive and continuous density at 9q. 

For simplicity of notation, all statements are implicitly conditioned on 9q being the true 
parameter value. Throughout this section, we will denote the posterior mean and covari¬ 
ance by 0 t := E t (0) and £ t = Cov t (0). Note that the expected square error E t (|0 — 9 1 2 ) 
is minimized by the mean 0 = E t (0). Thus, if the posteriors are strongly consistent, 
then Lemma 2.5 implies that 0 t 9o■ Note also that the square error is related to the 
variance through the identity E t (|0 — 0 t | 2 ) =tr(S t ). 

Lemma 3.1. Suppose that N1 and N3 hold and 0 is a bounded convex set with 
diameter < D < oo. Then, there exists a constant Cm,d < oo such that for all t, and 
x, 

|I t{Y x - 0 ) - (±£ t ) 0 4(0 t )| < C M , D Et(\e - 0t| 3 ), 

where 0 denotes the Frobenius product Aq B = . AijBij = tr (A T B), and I x (9) is the 

Fisher information matrix 


4(0) := J 


f 

v e p{y x I 0)' 

’V e p{y x | 0)' 

/ 

. PiVx I 0) _ 

. PiVx | 0) _ 


i T 


PiVx | 9) d y x . 


Proof. We can formally expand the mutual information as 

i t C4;0) = h t {Y x ) - E t (H(y a | ©)) 


= Jg(^Jp(yx | 9)pt{9) d0^j d y x - j (/ g(p(yx\9))dy x ^p t (9)d9 
= [ g( (p(gjx\9)p t (9)d9\ - f g(p(y x \ 9))p t {9)d9 dy x , 


where g(p) = —p\ogp. (Although H t(Y x ) — E t (H(y x | 0)) may not be well defined here, 
the last line is always well-defined and equal to the mutual information.) Denoting p Vx := 
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p{y x | ©t), Taylor’s theorem yields 


g{p) = - P Vx log P Vx - (1 + log p Vx ){jp -p yx )~ 


( P~Py J 2 , ( P~Py J 3 






where g Pj2/;E is some number between p Vx and p. The error term is bounded by 


\ £ Vx (p)l : = 


{P-Pvxf 


< -Py x = hexpdlogp-logpj/j) - 1 fp y „, 


6< lp,yx 6min {p,p y J 3 6 


and as | log p(y x \ 9) — log p(y x | © t )| < M\9 — ©t| < MD , we further obtain 

I £ y*(p(yx | 6>))| < i(exp(|logp(y x | 9) - log p(y x | © t )|) - 1 fp(y x | ©t) 

< ^(exp(M|6» - 0 t |) - 1 fp{y x | ©t) 

= c 1 \9-e t \ 3 P (yx\Qt). 

Due to the linearity of the integral, the constant and first order terms of the expansion 
cancel out, leaving just 


t (^;©)~ J 


-IJpjyx I 6)p t {9) d9 -pyj 2 + f[p(y x I 0) ~P Vx ?Pt(9) d 9 

2 Pyx 

| @ t )dy x , 

1 \p{yx\®t)/ 


d y x 


where the error is bounded by 


j 

<-l 

Jensen 

< 


[ P(y x | 9)p t (9)d9 S j d Vx — J J £ Vx (p{y x \ 9))p t (9)d9 dy a 

£y x p{yx I 0)pt(8) d 9 


\ £ y*(p(yx \ 0))\Pt(9) d 9\dy. 


I £y*(p{yx I 9))\p t {9)d9+ / \e y<c (p{y x \ 9))\p t (9)d9 ^ dy, 


< J 2 J Ci\9 — &t\ 3 p{y x | @t)pt(&) d9dy x < 2C 1 E t (|0 — © t | 3 ) 


for all t, ©t, and x (Jensen’s inequality applies as \e yx {p)\ is convex). 
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Now Taylor’s theorem yields 


AsAfL = , + Vfffe |9,) t (8 _ e,) + i (e _ e t )T^p(y. \/) {e _ 

p{yx I ©t) p{yx | ©t) 2 p(yx | ©t) 

where 9' is a convex combination of 0 t and 0. The coefficients are uniformly bounded by 

V gp{y x | ©t) 


and 


vip(»x 10') 


p(y* I ©t) 


p{yx | ©t) 

10') 


= |Vg \ogp(y x I ©i)| < M 


I Ve log p{y x | 9') V 0 logp^ | 0') T + logp^s | 0')l 


I ©t) 


|-|<M 


M<M 


<exp (MD) 

< exp (MD)(M 2 + M) =: C 2 . 

Thus, denoting the linear term by A and the error term by B, we obtain 
P(Vx I ©) 


H <m 


Var t 


P{Vx | ©t] 


= Var t (A) + Var t (B) + 2 Cov t (A, 5), 


where 


V ep{y x | 0 *) 

V e p(y x | 0 t ) 

- PiVx | ©t) - 

- P{Vx | ©t) - 


Var t (A) = E t 0 

Var t (B) < E t (\B\ 2 ) < {\C 2 ) 2 V t {\Q - 0 t | 4 ) < (±C 2 ) 2 DE t (\G - 0 t | 3 ), 

|Cov t (A B )| = |E t (AB) - E t (A) E t (B)| < E t (\A\\B\) < M\C 2 E t (|0 - 0 t | 3 ). 

=o 


□ 


For the next theorems and lemmas, we define the following conditions that depend on 
a subset U C 0: 

LI. |Vg log \p(y x | 9) - Vg log \p(y x \ 6')\ < p/2 for all 6,9' G U, x e X, and y x . 

L2. | logp(0) - logp(0')| < C for all 9 ,0' e 17. 

L3. The maximum likelihood estimator 0* := argmax 9gi7 p(Yi | 0) is eventually well- 
defined and converges to 9q as t increases within indices satisfying X t > f/r, where 
A t is the smallest eigenvalue of —Vglogp(Y t | 0 q). 


Lemma 3.2. Suppose that N4 and N5 hold. Then, for any p,C > 0, there exists a con¬ 
stant Sft'C < oo such that LI and L2 hold for any neighborhood U of 9o having diameter 
less than S^^c- 
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Lemma 3.3. Suppose that Nl, N3, and LI hold. Ifp(Y t \ 9) >p(Y t \ 9 o) for some 9 GU, 
then 


\9-9 0 \< 


W 

t vv 


where A t = t 1 / 2 Vlogp(Y t |0o)- Furthermore, conditioned on 9 q as the true parameter 
value, 


P{|A t | > a} < 2nexp 


2 nM 2 ) 


for all t satisfying X t > tp, where A* is the smallest eigenvalue of —Vglogp(Y t | 0q). 


Proof. Taylor’s theorem yields 


=.z t 

logp(Y t | 9) = logp(Y t | 0 0 ) + Velogp(Yt | 9 0 ) T (9 - 9 0 ) 

+ ±(9-9 0 ) T V 2 e \ogp(Y t \9')(9-9o), 

^ 

<-(l/2)\ t \8-9 0 \ 2 <-(l/2)t[i\e-8 0 \ 2 

for some 9’ between 9 0 and 9. Thus, p(Y t \ 9) > p(Y t \ 9 0 ) implies Zf (9 — 9 0 ) > A tp\9 — 
9q | 2 , which in turn implies \Z t \ > \tp\9 — 9 q\. This is equivalent to the first statement. 

Let us then prove the latter statement. Now \Z t \t~ 1 ^ 2 = \ A\ l > a implies that \z[ k \ > 
t^^a/y/n holds for at least one component k G {1, ..., n}. But as each z[ k ^ is a martingale 
satisfying Z^ = 0 and | zj^ 1 — Z^\ < M, Theorem A.4 yields 

P{|Z t (fc) | > t^a/V^} < 2 (-^ 2 ) 

for all k G {1,... ,n}. Summing these probabilities over k so as to give an upper bound 
on the probability that at least one component is over the limit gives the statement. □ 

Lemma 3.4. Suppose that N1-N3 and LI hold. Then, L3 holds almost surely. 

Proof. For any sufficiently small e > 0, N2 implies that the set V = B{9q,e) is a subset 
of 0. Lemma 3.3 applied to this set implies that @jf converges fast in probability to 9q, 
that is, the probability P{0jf ^ B(9q,e)} sums to a finite value over all f. This implies 
that 0J-^4 0 O . □ 

Theorem 3.1 (Asymptotic normality). Suppose that N1-N5 hold and let LI L3 hold 
for some p > 0, C > 0, and U C 0. Then, the following conditions surely hold when t 
increases within indices satisfying At > tp: 
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1. The posterior density of the scaled variable 


i 1 / 2 ( 0 _ ©*) satisfies 


I pMt | e £ u) - iV(&; o, Bf 3 )\ d <h ->■ 0, 


where denotes a normal density with given mean and covariance and B t = 

—t _1 Vg logjj(Y t | Qq). 

2. All moments as well as the entropy of pt{<ft | © £ U) are asymptotically equal to 

those of 0, Bf l ), that is, the difference converges to zero. 

3. Adjusting for the f 1 / 2 scaling factor, this implies in particular that t Covt(0 | U) — 
Bf 1 —> 0 and f 3,/2 E t (|0 — E t (0 | U)\ 3 \ U) < c n pT 3 ^ 2 for sufficiently large t for some 
constant c n , and so (assuming that U is bounded and convex), Lemma 3.1 yields 


sup 

x£X 


tl t {Q-Y x \U) 


\b^oi x {o 0 ) 


-s>0. 


Proof. The scaled variable <f> t takes values in the set Vt := {(ft £ R ra : 0j + t x ^ 2 (ft £ U}. 
A Taylor expansion of logp(Y t \(f t ) at <j> t = 0 yields 

p!ffa= 0 ) = exp ( ±£ ( r )) exp (~ B t<k ± 
for all (ft satisfying 0* + t~ 1 ^ 2 (f t £ B(6 0 ,r), where 


e(r) = sup i 

oo,y x ,SGB(do,r) 


log 


p(9) 


p{8') 


|Vglog p{y x I 6) -V 2 glogp(y x | 6')\ 


Denoting r t = t 1,/4 , we have S t := B( 0, r t ) C V t for sufficiently large t and e t = e(r t t x / 2 + 
|0j — 0 O I) —S y 0- It follows 

Pt(<ft ) ex f t {4>t) := exp(-±<f>lB t cj> t ) g t (<f t ) 

=:.V/(ch) 


for all (j) t £ Vt, where gt{4>) = exp(±e t ± |e^ 2 ) — > 1 for (f G S t . As N t (<f) is uniformly 
bounded and St — > M n , it follows [<f G Vt]ft(<f) — N t ((f) —t 0 for all <f G M n . Furthermore, 
as N t ((ft) < exp(—i/i|0| 2 ) and gt(<f) = exp(±C± jn\(f\ 2 ) for all <p G V t , it follows 

Jl+evmm'* /exp(c->| 2 )|#<oo, fwM*<oo 
for all k> 0, and so Lebesgue’s dominated convergence theorem implies that 
/|[0G Vt\ft{(f)u{(f) — N t {(f)u((f)\ d<f—>0 
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for any function \u((f>)\ < \cj)\ k . This implies that all moments of [</>S V t ]ft{4>) are asymp¬ 
totically equal to those of N t (4>). As the eigenvalues of B t are between p and M, the nor¬ 
malization constant Z := J N t {(j>) d <f> is within the constant range [( [2n/M ) n / 2 , (27t// x)"/ 2 ], 
and it follows that the moments of the normalized densities Pt{<j>t) and N(tf>p, 0, Bf l ) 
are also asymptotically equal. Similarly, as ft(4>) log f t ((f) — N t (</>) log N t (<f>) —► 0, where 
the log-factors can be bounded by polynomials of \(f>\, it follows that the entropies of 
Pt((ft) and are asymptotically equal. (Note that the entropy of a density 

p(x) = f(x)/Z can be calculated as — (f f logf)/Z + log(Z).) □ 

Lemma 3.5. Suppose that N1 and N3 hold. Then, conditioned on do as the true param¬ 
eter value, E(— Vj \ogp(Y x | do)) = I x (do) for all i£X, and 

g _ J2k= l^A't(flo) a.s. q 


where B t = -t 1 V^logp(Y t | d 0 ). 

Proof. 


E(”Vg logp(F a: | d 0 ) | 0 = do) 
ft i a \ f f ^ep{yx | *o) 

= I x (0o) - J Vep(v* I do) d y* 


Vep(y x 1 So] 
PiVx | 0o) 


i T 


v 2 e p{yx | Op ) 
pivx I do) 


d y x 


= 4(do)-V 0 j 'VdPiVx | 0o)dy x 
= 4 (do) - Vg J p(y x I d 0 ) d y x = I x (d 0 ), 


where the interchange of the order of integration and differentiation is justified by 
Lebesgue’s dominated convergence theorem for the dy x -integrable dominating functions 
fx{yx) and g x (y x ) given by 

IVgp(y x 16»)| =p(y x | d)\Vglogp(y x \ d)S7 e \ogp{y x \ d) T + S7 2 e logp(y x | (9)| 

< p(Vz I do) exp(M|0 - 0 O |) • (M 2 + M) 

< p{y x | do) exp (MD) ■ (M 2 + M) =: f x (y x ) 


and 


IVgp(y x | 6>)| =p{y x | d)\S7 e \ogp(y x | 0)| 

< p{y x | do)exp(MD) ■ M =:g x (y x ). 
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Thus, denoting Z k = -S7 2 g log p{Y Xk \ 9 0 ) - I Xk (#o), given 0 = 9 0 , the sequence Z\ -b 

Z k of partial sums is a martingale and satisfies E(\Z k \ 2 ) < (M + M) 2 < oo for all k, and 
so Theorem A.3 implies that ( Z\ H-+ Z t )/t^A 0, which is the statement. □ 

Corollary 3.1. Suppose that N1-N5 hold. Then, for all p > 0, almost surely fE t > 
( B t + pl)^ 1 (meaning that the difference is positive definite) for all sufficiently large 
t, where B t := —t _1 VQ logp(Y 4 | 9q). In particular, tr(fE t ) > (2 p)~ x and det(tE t ) > 
(/Ip)^ 1 (2M)~^ n ^ for all sufficiently large t satisfying minAs t <p< M, where minAs t 
denotes the smallest eigenvalue of B t . 

Proof. Let p > 0 be arbitrary and define an augmented observation model Yf := (Y X ,Z ), 
where Z ~ N(Q, p~ x I) is independent (given 6) from Y x . Let U be a neighborhood of 9q 
satisfying LI and L2 as well as L3 almost surely. If we choose the auxiliary component 
z t so as to obtain t~ l l z k = E(0 | y t ) for each t , then L3 remains satisfied given the 
augmented data and we also obtain E t > E(, because the augmented data will strictly 
decrease the square error from the original mean, and moving to the new mean can 
only further reduce this error. The normalized Hessian at 9q for the augmented data 
is B[ = B t + pi, and so, due to Lemma 3.5, minAs' > p/2 for all sufficiently large t 
(although wc have fiddled with the z k values, Lemma 3.5 still applies as it does not depend 
on these values). Thus, Theorem 3.1(3) implies that f Cov(0 | y' t ,U) — (-B() _1 —> 0 (note 
that Theorem 3.1 is a sure result and hence applies even with our fiddled z k values). 
Since Pi{0 £ U c } decays exponentially in the augmented model, it follows that also 
tYi't — ( B' t )~ l —> 0. As the eigenvalues of B' t are within the range [p/2, Mp/2], the matrix 
inverse behaves nicely and we obtain (tE() -1 — B' t —> 0, which implies (tE() -1 — B' t < el 
for all sufficiently large t for any e > 0. It follows fE t > tEJ > ( B t + (p + e)/) _1 for all 
sufficiently large t. □ 


4. Asymptotic optimality 


In this section, we assume that: 

01. C1-C4 hold globally. 

02. Some neighborhood Uo of 9q £ 0 is homeomorphic to a subset of K™ that satisfies 
N1-N5. 

03. There exists placements x\,..., x m £ X and nonnegative weights oq 3-+ a m = 1 

such that a jlxj{6 o) is positive definite. 

04. The placements X t satisfy 


I t(Q;Y Xt+1 ) 

su Pa;eX t (®; ^x) 


1 . 


(See Section 4.1 below for the definition of “W.) 

First, let us say a few words about the main difficulty related to the adaptivity of the 
placements, namely the complications caused by any secondary modes in the posterior 
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distribution. This issue is discussed by Paninski [10] in the context of consistency, but it 
seems that even after consistency has been established, the issue cannot be ignored. 

The information maximization strategy decreases the relative weights of any secondary 
modes only at a rate approximately proportional to 1/t [10]. Therefore, any secondary 
mode may have a contribution proportional to 1/t to all moments of the posterior dis¬ 
tribution. This means that only the first order moments of the approximating normal 
distribution remain asymptotically accurate, even though its total variation distance 
from the posterior does tend to zero. In particular, the inverse Hessian of the likelihood 
generally does not give an asymptotically accurate approximation of the global posterior 
covariance. (In fact, the global posterior covariance may be undefined as © need not have 
a global Euclidean structure.) 

For this reason, the asymptotic approximation to the expected information gain 
l t (Q;Y x | U) given by Theorem 3.1(3) only applies within a sufficiently small neighbor¬ 
hood U of the true parameter value, where the posterior can be shown to be asymptoti¬ 
cally unimodal. Nonetheless, even though the local and global moments are not in good 
agreement asymptotically, it turns out that It(0;Tx t+ i | U) is in fact in good agreement 
with lt(0; Y Xt+1 ) on “most trials”. Indeed, as the relative weights of any secondary modes 
typically decay at an exponential rate with the number of trials whose placements can 
distinguish between them, it follows that the placements of only a decreasing fraction of 
trials can be significantly affected by the secondary modes. 

To formalize this intuition, we will first develop a theory for measuring asymptotic 
proportions. 

4.1. Asymptotic proportions 

Definition f.l. To measure subsets K C N, we use the proportion measures 

p(K)= lim pi, n (K), Pa,b{K) = 0 , 

n-¥o o 0 — a 

where | • | indicates the cardinality of a set. (Note that although p a ^ is a measure in the 
measure-theoretic sense for any a, b £ N, the limit p is only a finitely additive measure.) 
When we say “for almost every n £ N ”, we mean that the set where the statement does 
not hold is a null set w.r.t. p. We use the notation Xk x to mean that there exists a 
subset K C N with p(K) = 1 such that [fc £ K](xk — x) —> 0. We also define 

limsupxfe := inf{x £ R: Xk < x for a.e. k £ N}, 
liminfxfc := sup{x £ R: Xk>x for a.e. k £ N}, 

fc'WOO 

and when both equal x, we write limfc^oo Xk = x. 

Lemma 4.1. Suppose that for all j £ N, the proposition P/ holds for a.e. k £ N. Then 
there exists an increasing sequence j(k) —> oo such that P/ A • • • A Pi ^ holds for a.e. 
keN. 
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Proof. For all j € N, Q^ k := P k A ■■■ A holds for a.e. k € N. Thus, for all j € N, 

s T k ' O j 
f Jk) := inf ^ i=1 ^ 

Jn ’ k’>k k' 

is increasing in k and tends to one as k —> oo. Choosing 

j(k) = max{/ e N: fj> (k) > 1 - 1 / j’} 


yields the statement. □ 

Lemma 4 . 2 . If Xk is a bounded sequence, then the following are equivalent: 

1. x k x, 

2. | Xk — x\ < e for a.e. keN for all e > 0, 

3. Hindoo Xk=x, 

4 - I EUil**-*!-»■ °- 

If Xk is not bounded, then 1-3 are equivalent and implied by f. 

Proof. All implications are fairly obvious. As an example, “2 => 1” follows from 
Lemma 4.1 applied to Pi = [|xfe — x\<l/j\. □ 

Lemma 4 . 3 . Let Xk be a nonnegative sequence. If x k < oo, then for any e > 0, 
the inequality Xk <s/k holds true for almost every k £ N (which implies k ■ Xk 0). 

Proof. Assume the contrary: for some e > 0 there exists a set K C N such that Xk>s/k 
for all k € K and for some c > 0, pi t k{K) > c for arbitrarily large k. As p± ,»+i (K) - 
Pk,n+k{K) < 2 k/n —> 0 as n —> oo for all k, we can recursively find an increasing sequence 
of indices k\ = 1, ki+\ > 2 ki, such that Pki,k i+1 {K) > c for all i. This yields 

OO OO oo 

^2 x k >^2c(k i+ 1 - ki)— > ^c(2fcj - ki)— = oo, 

k—1 2=1 2=1 

which contradicts the assumption. □ 


Lemma 4.4. Suppose that a sequence of random variables Xk : £2 —> [-M, M] satisfies 
X k X almost surely. Then, E(| Xk — _A|) -w 0. 

Proof. By Lemma 4.2(4) and the dominated convergence theorem, 

l^E(|X 1 -X|) = E(l^|^-X|)^E(HmlX:|X ( -X|)=0. 

k= l \ fc=1 / V fc=1 / U 


Corollary 4.1. Suppose that the event Ak happens for a.e. k £ N a.s. Then, P{Afc} 1. 
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p 

Definition f.2. We use the notation Xk X to mean that there exists a subset K C N 
with p(K) = 1 such that [k £ K](Xk — X)-^0. 

Lemma 4.5. Xf ^ X if and only if P {|Xk — X\ > e} 0 for all e > 0. 

Proof. The “only if” direction is obvious. We will prove the “if” direction. 

By definition, we have P{\Xk — X\ > 1/j} < 1/j for a.e. k £ N for all j £ N. Lemma 4.1 
then implies that there exists an increasing sequence j’(fc) —> oo such that 

P{\X k -X\>l/j(k)}<l/j(le)->0 


for a.e. k £ N. □ 

Lemma 4.6. Suppose that a sequence of random variables Xf~ satisfies Xf~ X almost 

p 

surely. Then, Xk~^*X. 

Proof. Let e > 0 be arbitrary. Denoting 

Y t = \iZ\\ x k- x \>el 

1 fc=l 

Xk X implies that Y t 0. As Y t is bounded, the dominated convergence theorem 
implies 

1 * 

0 = e( lim Yt) = lim E(F t ) = lim - V* P{|A fe - X\ > e} 

\t —>oo J t—^oo t—too t * 

k =1 

and so Lemma 4.2(4) yields P{|Xfc — X\ > e} 0. Now Lemma 4.5 implies the state¬ 
ment. □ 

4.2. Asymptotic D-optimality 

In this section, we show that the greedy information maximization strategy satisfies 
asymptotically a condition known as D-optimality. This condition is defined as maximal- 
ity of the determinant of the Fisher information matrix of the experiment at the true 
parameter value 6q. The D-optimality criterion is special among all functionals of the 
information matrix (such as the trace, minimum eigenvalue, etc.) in that it is insensitive 
to linear or affine transformations of the parameter space ©. Furthermore, in the asymp¬ 
totically normal models that we are interested in, it yields a (local) approximation of 
the posterior entropy, which is the utility function commonly used in adaptive estima¬ 
tion settings. We will make use of this fact in the next section to derive an asymptotic 
expression of the posterior entropy. 
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Lemma 4.7. For almost any 9 q £ 6 satisfying 01-03, there exists a constant c such that 
for all p > 0, given Oq as the true parameter value, almost surely It(0;Tx- t+1 ) > cftp)^ 1 
for all sufficiently large t satisfying X t < tp, where Xt denotes the smallest eigenvalue of 
— Vg logp(Y t | Oq)- 


Proof. Denoting I := Jf JLi a jlj, where oij and Ij := I Xj ( 9o ) are given by 03, the small¬ 
est eigenvalue min A/ is positive. 

Suppose that Uq has diameter D and let Cm,d be the constant of Lemma 3.1 applied to 
Uq as the parameter space. The same constant also applies to any subset U = B(9q,S/2) C 
Uq with diameter 5 < D and as the posteriors are strongly consistent in U, too, Lemma 2.5 
implies that E t (0 | U)^X9 0 - Thus, N3 and N4 imply that |/a;(E t (0 | U )) — I x (9 0 )\ < S 
for all x for all sufficiently large t. We obtain 


lt(^;©| U)> 
> 


> 


> 


> 


> 


\ Cov t (0 | U) © 4(E t (0 | U)) - C M ,DE t (|0 - E t (0 | U)\ 3 \ U) 
\ Cov t (0 | U) © 4(E t (0 | U)) - C m ,dMS\& ~ E t (0 | U)\ 2 \ U ) 


tr(Cov t (© | U)I x (E t (Q | U))) -C m ,dS tr(Cov t (0 | U)) 

1 ' 


tr(Cov t (0 | U)I X (9 0 )) - C m ,d + ~ Utr(Cov t (0 | U)) 


1 max tr(Cov t (© | U)Ij) - (c m ,d + ^ tftr(Cov t (0 | U)) 

2 i=i,...,m \ 2/ 

itr(Cov t (0 | U)I) - (c m ,d + ^<Jtr(Cov t (0 | U)) 

i tr(Cov t (© | U)) min A/ - ^ C m ,d + 0<5tr(Cov t (0 | U)) 

( min Xj 
2 


- Cm,d + tt 5 tr(Cov t (0 I U)) =: ctr(Cov t (0 | U)) 


for some x £ X (fourth inequality) for all sufficiently large t (third inequality), where we 
have used the fact that tr(A) min As < tr(AB) < tr(A) maxAs (sixth and third inequal¬ 
ities). Let us then choose S < min A/ /(2Cm,d + 1) so that c as defined above is positive. 
Now, the inequality I t (Q;Y x ) > p t (Z7)I t (0; Y x \ U), which follows from the chain rule of 
mutual information (cf. the proof of the next lemma), and C4 + Corollary 3.1 imply 


I t (0; Yt+i) > 7 supli(0; Y x ) > 7 sup p t (U)l t (Q;Y x \ U ) 

> 7 Pt(D)ctr(Cov t (0 | U)) > 'yp t (U)c{2tpy 1 . 


As Lemma 2.5 yields Pt(U) 1, the statement follows. 


□ 



22 


J. V. Kujala 


Lemma 4.8. For almost any 9 q £ 0 satisfying 01-03, there exists a neighborhood U C 
Uo of 9 q such that conditioned on 9 o as the true parameter value, almost surely, 


Qt 


h{Q-,Y Xt+1 \U) 

I t(Q-,Y Xt+1 ) 


Proof. By Lemmas 2.2, 2.3 and 4.3, almost surely, the convergences 

I t (6-,Y Xt+1 \U)->0, 
t! t ([Q £ U]-,Y Xt+1 ) 

hold for all neighborhoods U in a countable basis of the compact metrizable space ©. 
It follows that the same is true conditioned on almost any 9q £ 0 as the true parameter 
value. Thus, given almost any 9q £ 0, we can pick a neighborhood U C Uq of 9q from the 
countable basis such that the above convergences almost surely hold. 

Lemma 4.7 (applied to p = M) almost surely yields 

I t (0; Yt+i) > c(Mt)~ 1 =: Cit -1 

for all sufficiently large t, where we denote = Y Xt+1 - Condition C4 + Lemma 2.5 
yields 

I t (0;F i+ i) >7 supI t (Q;Y X ) > 7 cp t {U c ) =: c 2 p t (U c ) 
for all sufficiently large t , and the chain rule of mutual information yields 

l t (0;y t+ i) =I t ([0 G U}-Y t+l )+p t (U)l t (Q-Y t+1 \ U) + Pt (U c )l t {Q-Y t+1 \ U c ). 


Thus, almost surely, 


1*( 0 ; Yt+i I u) 1 
i* (©; Yt+i) p t (U) 


<It(e;l't+i)/c2 ->0 '-♦0 

MlF) | U c ) + tl t {[Q £U}-,Y t+1 jt-' - 

I t (B;T t+ i) 

Si v ✓ 

>ci£ — 1 


-w 1. 


□ 


Corollary 4.2. Conditioned on almost any 9 q satisfying 01-04, the sequence 
D t := sup 1 © I x (0o) - B~ x © I Xt+1 (0 O ) 

x£X 

satisfies [minAB t > p\D t 0 a.s. for any given p > 0, where minAs t denotes the smallest 
eigenvalue of B t := — t _1 Vg logp(Y t | 9q). 


Proof. Let us first shrink the neighborhood Uq of 9q as necessary to make its diameter 
smaller than the constant S IJ: c given by Lemma 3.2. Then, let 17 C Uq be the neighbor¬ 
hood of 9q given by Lemma 4.8. By Theorem 3.1(3), there now exist random sequences 
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E t —> 0 and E' t ^ 0 such that conditioned on 9 0 as the true value, 

isup£ t -1 ©/ x (0 o ) = sup tl* (0; Y x I U) + E t , 
^ xex xex 

\b r 1 © Ix t+1 (do) = «*(©; Yx t+1 I U) + E' t 

whenever min A n t > P- For these t, it follows 




B t 1 Ql x (0o) 



=tT(Bt 1 I*(8 0 ))<nii- :L M 


( l t (Q-,Y Xt+1 \U) \ 

V sup xeX I t (0;y a . I C/)7 


+ E t -E 


/ 

tl 


where Lemma 4.8 and the inequality I*(0; Y x ) > p t (U)I t (Q\Y x \ U ) yield 


lt(e-,Y Xt+1 \U) lt(Q;Y Xt+1 \U) 

suPxex 1 ^©;^ | U) ~ Pt -supxgxMO;^) 


= p t (U)Q t R t ^l, 


and so [min A b, > p\D t 0. 


□ 


Lemma 4.9. Conditioned on almost any 9 q satisfying 01-03, there exists p such that 
minAs t > p for infinitely many t £ N, where minAs t denotes the smallest eigenvalue of 
B t = -t- 1 V 2 g p(Y t \e 0 ). 


Proof. Let p > 0 be arbitrary. Lemma 4.7 almost surely yields l t _i(0; Yx t ) > c{tp)~ l for 
all sufficiently large t satisfying nhnAs t < p and Lemma 4.8 implies that L_i(0;Yx t | 
Uq) > c(tp ) _1 for a.e. t satisfying minA_B t < p- Let then K M := {t £ N: min A B t > p} 
and suppose that p(K^) = 0. Then, pj := p 2 i. 2 i+ 1 (-K/J —> 0, and then exists jo such that 
Pj <1/2 for all j > jo- It follows 


2 J1 —1 


2 J1 — 1 


j 1-1 2 i+1 -l 


i=l ^ t=l r j=lo t=2a(l+pj) ^ ' 


and so 

I fc -i(e;Yx fc | C/ 0 ) > (^log|) log 2 (t- 1)-^ 

for all f = 2 °, j > jo- Since /r was arbitrary, this implies that the sum grows asymptotically 
superlogarithmically if p(K^) = 0 holds for all p > 0. If this event has positive probability 
among all 9q £ Uq, then also 



I(0;Y t |[/ o ) = E 


£i fe -i(0;^ I Uq) 


^k —1 
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grows superlogarithmically, contradicting Lemma 2.9. Thus, for almost all 9q £ Uq satis¬ 
fying 01-03, either is not p-measurable or p^K^) > 0. In either case is infinite. □ 

Theorem 4.1 (Asymptotic D-optimality, part 1). Conditioned on almost any 6 q £ 
0 satisfying 01-04, almost surely, 

B t := — t~ l W 2 e logp(Y t | 8q) —> B* := argmaxdet(l?), 

Bel 

where X is the convex hull of the closure of {I x {9 o)}xex- The maximizer B* is unique, 
because the determinant is log-concave on the compact convex setX. This result is optimal 
in the sense that for any strategy of choosing the placements X t (instead of 04 and C4 ), 
almost surely limsup^^det(B t ) < det(B*). 


Proof. The objective function is 

f(B) = 


f logdet(B), 
1 -oo, 


min Xb > 0, 
otherwise, 


where A b denotes the set of eigenvalues of B. Lemma 3.5 implies that B t is asymptotically 
a convex combination of matrices in the closure of {I x (0o)}xex and so limsup^^ f(B t ) < 
f(B*). Let us then show that this upper bound is tight. 

First, we choose some representation B* = Y^k= l a klk of the optimum point, where Ik 
are matrices in the closure of {/x(#o)}a:ex and Y^k= l a k = 1- 

For any symmetric real matrix B t , we have (with slight abuse of notation) 


Vf(B t ) = B r\ 

72«di_ iye>-1\ 


V^f(B t ) = -[(B-%(B-^}l j , 

[X 2 f(B t )]B = ~[(B-\(B~^ Q B}^ = ~B~ l BB~\ 

B © [V 2 f(B t )}B = - tr (B^BB^B), 
and Taylor’s theorem yields 

f{B t+ i) = f(B t ) + Bi 1 0 {B t+1 - B t ) - i tr {B^B'B^B'), 
where B' is between 0 and B t +\ — B t . Denoting B := — Vg log(p(Yx t+ i | #o)), we obtain 

tBf + B 


f(B t+1 ) - f(B t ) = f 


t + 1 


- f(B t ) 


= BT 1 0 


> 


B-B 


1 


t + 


t + 1 


"j— — q tr {B^B'B^B') 
1 Z N- v -' 

|-|<n4M 2 /x -2 (t+l) -2 

2 nM 2 p ~ 2 ' 


B f 1 QB-n- 


t + 1 
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for all indices t satisfying minAs t > p for any p > 0. Denoting by A i the eigenvalues of 
Bf 1 B*, Corollary 4.2 now implies that 

Bf 1 © I Xt+1 (0o) + D t = sup Bf 1 0 I x (6 0 ) 

xGX 

> maxBj -1 © 4 > © I k ) = Bf 1 © B* 

k 

n n n 

= tr(B~ 1 B*) = Y A* = n + ^(Ai - 1) > n + ^log(Aj) 

2 = 1 2—1 2=1 

= n + logdet (Bf 1 B*) = n + f(B*) - f(B t ), 

where [min A_e t > p]D t 0 for any p> 0. Noting that Ix t+1 (do) = E t (B \ do), we obtain 
E t(f(B t+ i) | e 0 ) - f(B t ) > - f(B t ) - £>„,*), 

where + (2 nM 2 p~ 2 )/(t + 1). 

From now on, in order to keep the notation clean, we will implicitly condition all 
probability statements on 0 = do ■ 

Let the constants /o < fi < f(B*) be arbitrary and define p := exp(/ 0 )M 1_ "/2 > 0. 
Suppose that some to satisfies f(B to ) > fo ■ Then, the definition of p guarantees that 
minAstj, > 2 p. Let then a s]l,exp (p/M)\ be arbitrary. Since minAB t can decrease by at 
most M/t per each step, we obtain 

min As t >2 p— — >2 p — M log — > p 

^— J t to 

t=t 0 +1 

for all t between to and 4 := [o^oj ■ Thus, the following inequalities hold true for all 

t € [t 0 ,4[: 


E t _i/(4 t ) - > \(f(B*) - f(B t ^) - D^), 

E t-i(tf(B t ) - (t - l)/(B t _i)) > f(B*) - D^^i, 

E t0 (tf(B t ) - (t - l)/(B t _i)) > f(B*) - 

£l ti 

£ Et 0 (tf (Bt) (t 1)/ (Bt—i)) ^ X! (JW-EtoIV-r), 

£=£o + l t=£o + l 

*i-l 

E to (t 1 /(S tl )) - tof(B to ) > (4 - t 0 )f(B*) - ]T EtoD^t, 

t=t 0 
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and dividing by 1 1 , we obtain the inequality 


E t0 /(B tl ) - a- l f{B t0 ) > (l - *£) f(B*) - E t0 (j- £ 


(1 -cT')f{B*), 


where we have used the fact that t\ < ato, and where the convergence holds for any 
increasing sequence of indices to satisfying f(B to ) > fo (which implies min A jj t > /i for 
all t £ [t 0 M)- This convergence is obtained by applying Lemma 4.2(3) to the bounded 
sequence [minAs t > 0, which yields 


1 

ti 


t i-i 

t=to 


2 tl — 1 

< T- E I[ min As t > tA D v,t\ -> 0 


(and since |Z) M)t | < 2nM/i _1 + 2n¥ 2 /i -2 for all t, Lebesgue’s dominated convergence 
theorem allows us to take this limit inside the expectation). Thus, there exists a positive 
constant s such that 

E t J(B tl )>f(B t0 )+2s 

for all sufficiently large to satisfying /o < f{B to ) < fi- Also, since the maximum change 
in the value of / over one step is bounded by v/t for some constant v > 0 (depending on 
/i), we obtain 


Var to f(B tl )< 



Now Markov’s inequality yields 


2 dt=i > 2 



i)<A 

tl / to 


Pt 0 {f(B tl ) < f(B t0 ) + s} < P t 0 {f{B tl ) < E to f(B tl ) - s} 

<P t0 {\E t J(B tl )-f(B tl )f>s 2 } 

Var to f(B tl ) 

s 2 ~ t 0 s 2 ' 

As this upper bound on the probability sums to a finite number over the sequence to (A:) 
determined by to(fc + 1) = ti(fc) = [ato(k)\, the Borel-Cantelli lemma implies that al¬ 
most surely /(St 0 (fc+i)) < /(-®t 0 (fc)) + s holds for only finitely many indices k £ N sat¬ 
isfying /o < /(S to (fe)) < f±. Thus, there exists ko such that for all k > ko, whenever 
fo < f(B to{k) ) < fi, the value f(B to (k)) will increase by at least s on each step as k 
increases. Furthermore, since 


\f(B t )-f(B to{k) )\ 


< 


ti (k) 


E V 

-<vlog 

t=t 0 (k)+l 


tl(fc) 

to(fc) 


< v log a 
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for all t £ [to{k),ti(k)[, it follows that if /(-Bt 0 (k)) > fo for any k > ko , then f(B t ) > 
fi—v log a for all sufficiently large t (provided that fi~v log a > fo ). Since f±—v log a can 
be made arbitrarily close to f{B*) by appropriate choices of rational a > 1 and rational 
/i < f{B*) for arbitrarily small rational fo, we almost surely obtain liminft^oo f(B t ) > 
f(B*) unless f(B t ) eventually stays below any number. But this would imply that 
limsup^gQ min Xg t < 0, which is almost surely contradicted by Lemma 4.9. □ 

Corollary 4.3 (Asymptotic D-optimality, part 2). Conditioned on almost any 
do £ 0 satisfying 01-04, there exists a neighborhood U of 9 q such that tCov*(0 | 
17) -^A(i?*) _1 . This is optimal in the sense that for any other strategy in place of 04 
and C4, almost surely liminf t _ ) , oo det(tCovt(0 | U)) > det(B*) _1 . 

Proof. Given 04, Theorems 4.1 and 3.1(2) imply that fCov t (0 | U) —>(B*) -1 . For any 
other strategy, we have limsupt^.^ det(-Bf) < det (B*) a.s., and so Theorem 3.1(2) yields 
liminf t ^ oo det(tCov t (0 | U)) > det(B*)~ 1 a.s. as t increases within indices satisfying 
min As t > p for some given p > 0. But Corollary 3.1 implies that if we choose a sufficiently 
small p> 0, then det(t Cov t (0 | U)) > det(i?*) _1 also for minAs t < p, and the statement 
follows. □ 

Remark f.l. As discussed in the beginning of this section, secondary modes with 
weights proportional to 1/f may remain outside U, and they do contribute to the asymp¬ 
totic variance. Thus, the D-optimality result (part 2) shown here is only a local form of 
optimality. 

The situation would be different if the placements were chosen so as to minimize the 
determinant of the posterior covariance Covi(0) directly (which, of course, presupposes 
that the parameter space has global Euclidean structure). Then, slightly more trials 
would be spent to decrease the weights of the secondary modes, but they should remain 
insignificant in proportion. Thus, we can conjecture that B t —4 B* would still obtain in 
Theorem 4.1 with tCov t (0) asymptotically equal to (Bt) -1 , making the result globally 
optimal. 

4.3. Asymptotic entropy 

Here we use the D-optimality result to derive an expression for the asymptotic entropy. 

Corollary 4.4. Conditioned on almost any 6 q £ 0 satisfying 01 04, for any neighbor¬ 
hood U of do, there exists a constant cjj such that almost surely, pt(U c ) < Cu/t for a.e. 

teN. 

Proof. Theorem 4.1 implies that minAs t > p for all sufficiently large t for some p > 0. 
Hence, given any e > 0, Theorem 3.1(3) yields 

fit(0; Y Xt+1 | U) < sup Bf 1 0 I x (d 0 ) + £ < np -1 M + s =: c 

tc£X 
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for all sufficiently large t, where U is any sufficiently small neighborhood of 9q. Combined 
with Lemma 4.8, this implies that I t (0; Yx t+1 ) < 2 c/t for a.e. IgN, and so Lemma 2.5(2) 
yields the statement. □ 


Remark J^.2. Note that the statement of Corollary 4.4 holds only for a.e. t G N. What 
happens in a sufficiently long run is that most trials are spent on increasing the accuracy 
around the global mode and an approximately logarithmically growing number of trials 
is spent on placements that decrease the weights of secondary modes. However, on any 
such trial there is a small probability that the weight of the secondary mode actually 
increases, and given a sufficiently long run, this will eventually happen arbitrarily many 
times in a row, making the weight of the secondary mode temporarily arbitrarily much 
larger than the c/t bound that holds on most trials. 


Theorem 4.2. Conditioned on almost any 6 q £ Q satisfying 01-04, if the prior entropy 
H(0) w.r.t. a parameterization that is consistent with the local Euclidean structure (i.e., 
the prior density p{8) is given w.r.t. a measure that coincides with the Lebesgue measure 
on subsets ofUo) is well-defined and finite, then, almost surely 

77 1 77 

H t (0) + ^ log t-^H* := —- logdet(H*) + -log(27te). 


Proof. Let us condition everything on 6 o being the true value. Theorem 3.1(2) implies 
that for some sufficiently small neighborhood U of Oq, 

Ht(B 11/) + | logoff*. 

Lemmas 2.6 and 2.8 imply that for any e > 0, |Hi(0 | U c )\ < et for all sufficiently large 
t, and as Corollary 4.4 yields pt{U c ) < c/t for a.e. t, Lemma 4.2(2) implies p t (U c )H t (0 | 
U c ) 0. The statement now follows from the chain rule of entropy 


H*(0) =p t (U)Rt(Q | U)+ Pt (U c ) H t (© | U c )+ H t ([0 e U]), 




—^0 a.s. 


where the first term satisfies 


77 

Pt{U)R. t {Q\U) + -\ogt = p t {U) 


77 

H t (0 | [/) + — logt 


77 

+ Pt (U c )-\ogt^H*. 

<c/t 


□ 


Corollary 4.5. Suppose that 01-04 hold for almost all Oq £ 0 and that the prior entropy 
H(0) w.r.t. a parameterization that is consistent with the local Euclidean structures Uq 
in 02 is well-defined and finite. Then, 


Tl P 

H t (0) + -log t~^H*. 
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In other words, there exists a set K C N of indices with p(K) = 1 such that 

H t (0) + |log t$H*, 


as t increases within K. 

Proof. Apply Lemma 4.6 to the statement of Theorem 4.2. □ 

4.4. Varying cost of observation 

In Kujala [5] the adaptive sequential estimation framework is generalized to the situation 
where the observation of Y x is associated with some random cost C x of observation, 
which given the value of Y x , is independent of 0 and the results and costs of any other 
observations: 


0 


Y x 

i 

C x 


Y x , 

i 

C x > 


The technical requirement that C x depends on 0 only through Y x is satisfied in particular 
if C x is a component of Y x . Thus, it leads to no loss of generality if the incurred costs 
are observable. 

The goal considered in Kujala [5] is maximization of the expected information gain 
of a sequential experiment that terminates when the total cost overruns a given budget. 
To achieve this goal, the heuristic of maximizing the expected information gain I t (0; Y x ) 
divided by the expected cost E t (C x ) on each trial is proposed. In this section, we are able 
to show that this heuristic is in fact asymptotically optimal (as the budget tends to infin¬ 
ity) under essentially the same conditions that the plain information gain maximization 
is. 

Thus, condition 04 is now replaced by the following: 

04'. The placements satisfy 

_ lt(0; Yx t+1 )/E t (Cx t+1 ) 

‘■“sup xeX (I t (Q-Y x )/-E t (C x ))^ ’ 

where \C X \ < M, E (C x \ Oq) > 7 ' > 0, and the family of expected cost functions 
{9 !->■ E (C x | 6): x € X} is equicontinuous at 6 0 . 

Due to the assumed bounds on the expected cost E (C x \ 6 0 ), condition C4 is still satisfied 
and so all the previous lemmas depending on it apply. Together with the following lemma, 
these bounds also imply that the total cost grows asymptotically within linear bounds. 
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Lemma 4.10. Suppose that 04' holds. Then, conditioned on do as the true parameter 
value, 

Ct-YLiHCx^eo) a.s 

t ’ 

where Ct := Y^k=i Ex k ■ In particular, for any 7 < 7 ', almost surely Ct > tj for all suffi¬ 
ciently large t (as well as Ct < tM for all t). 

Proof. Denoting Zk = Cx k — E (Cx k \ 80 ), given 0 = 9 0 , the sequence Z\ H-h Zk of 

partial sums is a martingale and satisfies E(\Zk\ 2 ) < M 2 < 00 for all k, and so Theo¬ 
rem A.3 implies that (Z± H- \- Z t )/t^>-Q, which is the statement. □ 


Next, we will generalize Corollary 4.2 for the cost-aware placements. 


Corollary 4.6. Conditioned on almost any 9 q satisfying 01- 03 and 04', the sequence 


D t := supB t 1 0 
xex 


Ix{9p) 

E (C x | 9 0 ) 


1 Ix t+ 1 {9 0 ) 

t W E t (C Xt+1 \9o ) 


satisfies [minA(c t /t)s t > p\Dt 0 a.s. for any given p > 0, where min \c t /t)B t denotes 
the smallest eigenvalue of B t := —Cf 1 S7 2 g logp(Y t | 9q) and Ct := Y^k= 1 Cx k ■ 


Proof. Let us first shrink the neighborhood Uq of 9q as necessary to make its diameter 
smaller than the constant S lh c given by Lemma 3.2. Then, let U C Uq be the neigh¬ 
borhood of 9q given by Lemma 4.8. The boundedness and equicontinuity at 9q of 0 1 —X 
E (C x | 9) G [ 7 ', M\ imply that conditioned on 0 = 9o, almost surely, E t (C x ) —>■ E (C x | 6 > 0 ) , 
uniformly over all a; £ X. Combined with Theorem 3.1(3), this implies that there exist 
random sequences E t —> 0 and E' t — > 0 such that conditioned on 9q as the true value, 


7 sup B t *© 

* xGX 


Ix(9 0 ) 

E (C x | 80 ) 


„ l t (Q-,Y x \U) 
xex ‘ E t (C.) 


■Et 


Ir - 1 p) Ixt+1 ^ 6 °) 

2 * W E(C Xt+1 |0o) 


h(e;Y Xt+ 1 \U) 
4 E t(C Xt+1 ) 


K 


whenever min A (c t /t)B t ^ T- Eor these t, it follows 



(1 p-i n A/ lt(e;Y Xt+ 1 \U)/E t (C Xt+1 ) \ 

\2xex ‘ E(C X \ 8 0 ) VV sup xeX (l t (Q;Y x \U)/E t (C x )) J 

<tr(B t _ 1 / x (eo))/7<n(7/i)- 1 M 

+ E t — E' t , 
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where Lemma 4.8 and the inequality Ii(0;Fz) > pt{U)l t (Q;Y x \ U ) yield 

lt(&;Y Xt+1 \U)/MCx t+1 ) lt(Q;Y Xt+1 \U)/Y t (C Xt+1 ) _ 

sup xeX (I t (Q-,Y x \U)/E t (C x )) J sup xeX (I t (0;F x )/E t (C x )) PA Nt *" ’ 

and so [min \c t /t)B t > P] D t 0. □ 


Lemma 4.11. The range of the expression 


e&= i ^ x k m 

El=iE(aj0o) 


over all sequences Xk in X and all finite t is a dense subset of the set X defined as the 
closure of the convex hull of 


[ 4(0o) 1 

1 E(C X | do) J xeX 


Furthermore, the range of the limits of all converging r t equals X. 


Proof. For any sequence {xk}, we have 

EUWflo) HC Xh \8o) \ I Xh (8 o) 

‘ ELi E (C*J0o) j^ VEUE(C,Jg 0 )y E(C a Jg 0 )’ 

= 'Oik,t 

and so rt is always a convex combination of elements in S. The convex combination 
is not exactly linear w.r.t. the number of different x in the sequence because of the 
different E(C Xk \ 9q) weights, but nonetheless, by varying the proportions of different x 
in a sufficiently long sequence, any convex combination can be approximated arbitrarily 
well. □ 


Theorem 4.3 (Asymptotic D-optimality, part 1). Conditioned on almost any 9 q £ 
0 satisfying 01-03, 04', almost surely, 

B, ;= — Vllogp(Y,|Oo) ^ B , ;= argmaxdet(B)> 

Ct B£l 

where Ct := El=i C Xk and X is the convex hull of the closure of 

Ix(9 o) 


S = 


E (C x | O 0 ) 


: ieX . 


This is optimal in the sense that for any strategy of choosing the placements X t (instead 
of 04' and C4), almost surely limsup^^det(B t ) < det(f?*). 












32 


J. V. Kujala 


Proof. Since S is bounded, X is a compact convex set and B* is well defined. Lemmas 
3.5, 4.10, and 4.11 imply that limsup^^ det(B t ) < det(B*) a.s. Let us then show that 
this upper bound is tight. 

Lemma 4.11 implies that there exists a representation 


B* 


lim 

m—too 


E m 

k =1 Ck 


of the optimum point B* where (Ifc,Cfc) are elements of {(I x (9 0 ), E(C X | 0 O )): x S X}. 

Denoting B := -V 2 d log (p(Y Xt+1 \ 0o)) and C := C Xt+1 , and assuming min A (c t /t)B t > H, 
we obtain 

\B\,\C\<M, IB- 1 ] < (/x/M)- 1 , \B-CB t \<M + M 2 /n , C t + C > 7 (t +1) 

and so, for some B' between 0 and B t +i — B t , we obtain 


f(B t+ i) - f(B t ) = f 


C t B t + B 
Ct + C 




= B7 1 0 


> 


B — CB t 1 

~Ct + C 2 tr (-®t B ) 


C t +C 


B t _1 Q B — nC — 


[(/i/MJ-HM + MV/if 
C t + C 


> E t (C | 0 O ) / i _ B _ nC \ 

" Ct + C V ‘ E t (C71 6 > 0 ) E t (C|0 o ) i + 1 / 

>( 7 /M)/(t+l) 


Denoting by A i the eigenvalues of B t 1 B*, we obtain 

d-i ^ Ix t+ i(^o) n D _i Ix{9 o) 

B t © 7777 ;-tttt + = Sup B t © 


E(C* t+1 | 6> 0 ) xe 5 £ ^ E(C X | 0 O ) 

> sup (b ^ 1 © > lim lQ4) = B ^ 1 © B* 

fc V C kJ rn^oo x Cfc 

n n n 

= tr(S t _1 B*) = ^ Ai = n + y^(Ai - 1) > n + y^log(AQ 


2—1 

> — 1 E>* 


= n + logdet (B^B*) =n + f(B*) - f(B t ), 


where Corollary 4.6 implies that [min X(c t /t)B t > /r]-D t 0. Noting that E t (i?/E t (C’ | 
0o) I 0o) = Ix t+1 (0o)/E(Cx t+1 | 0o), it follows 


Et(/(B t+1 ) | 0o) - /(B t ) > 


7 /M 

t - 1-1 


(f(B*)-f(B t )-D^ t ), 
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where D^ t = D t + 1). 

From here on, the proof is essentially the same as in the maximum information case. 
We just use p := exp(/ 0 )M _ "/2 to guarantee that min A (c t /t)B t > for f(B t ) > fo- □ 

The part 2 of the D-optimality result as well as analogs of the asymptotic entropy 
results follow with essentially the same proofs (just replacing t with Ct at appropriate 
places): 

Corollary 4.7 (Asymptotic D-optimality, part 2). Conditioned on almost any 
do £ 0 satisfying 01-03, 04', there exists a neighborhood U of do such that Ct Cov*(0 | 
U )-^f(_B*) _1 , where Ct := Cx fc . This is optimal in the sense that for any 

other strategy in place of 04' and C4, almost surely liminf t-too det (Ct Cov t (0 | U)) > 
det {B*)~ l . 

Theorem 4.4. Conditioned on almost any do £ 0 satisfying 01-03, 04', if the prior 
entropy H(0) w.r.t. a parameterization that is consistent with the local Euclidean struc¬ 
ture (i.e., the prior density p(d) is given w.r.t. a measure that coincides with the Lebesgue 
measure on subsets of Uo) is well-defined and finite, then, almost surely 

n in 

Ht(9) + 2 log C t ^H* := - ^ log det (B*) + -log(27te), 

where C t := El=i c x k ■ 

Corollary 4.8. Suppose that 01-04 hold for almost all do £ 0 and that the prior entropy 
H(0) w.r.t. a parameterization that is consistent with the local Euclidean structures Uo 
in 02 is well-defined and finite. Then, 


H t (0) + |log CAH\ 

where Ct := El—i Cx k ■ In other words, there exists a set K CN of indices with p(K) = 1 
such that 

H t (0) + ^log C t ^H*, 

as t increases within K. 


5. Examples 

In this section, we give specific examples illustrating the optimality results. 

Example 5.1 (Psychometric model). Consider the psychometric model, where an 
observer’s unknown intensity threshold 0 for detecting a stimulus of intensity x is dis¬ 
tributed uniformly on [0,100] and the trial result Y x £ {0,1} for a test intensity x £ [0,100] 
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is distributed as 


= l (detected), 

\yx \ 1 | 1 — ip(9 — a;), y x = 0 (not detected), 


where ip(x) is the psychometric function, here assumed to be the sigmoid 


ip(x) = 


1 

1 + e - 1 


for simplicity (for more general psychometric models, see Kujala and Lukka [7], and the 
references therein). 

In this model, the Fisher information of a given placement x is calculated as 


4(0) = PiVx | 0) 


y*=o 


— log p{y x | 9) 


if'(9 — x ) 5 


if (9 — x)[l — if (9 — a;)] [l+e e_a: ] 2 


Thus, for any given 9q , the D-optimal value of the averaged Fisher information in Theo¬ 
rem 4.1 is B* = ^ given by the placement x = 9q to which the greedy algorithm eventually 
converges. Now Corollary 4.5 yields 

yi pi 77, 

Hi(0) + -log t^H* = --logdet(S*) + -log(27re) (5.1) 

= 0.25 


and this is the asymptotically optimal posterior entropy. In this example, the same ex¬ 
pression also gives the asymptotically optimal expected utility E(H t (0)) + j log t, which 
we will next compare to that of the offline design. 


Example 5.2 (Offline design). A rigorous study of the optimal offline design is beyond 
the scope of the present article, so we will not go into detailed proofs here but only sketch 
the general ideas. Suffice it to say that for an offline design for optimizing the expected 
utility E(H t (0)), one cannot do much better than to use the usual strategy of placing 
the trials evenly on the interval [0,100]. (Due to boundary effects, an exactly uniform 
distribution of placements is not really the global optimum, but for simplicity, we avoid 
a more complicated discussion here.) 

For uniform placement of trials on [0,100], Lemma 3.5 implies 


B t 


i /-100 

ml = 


‘ 1 . 


100 \1 + e ~ 9 ° 1 + e 


100-fl 


G [0.005,0.01], 


where B t = —t 1 Vglogp(Y t | 0 O ), and it can be shown that the asymptotic posterior 
entropy satisfies 


77 

H*(0) + — logi — 


—- lo gdet(B t ) 


- log(27te) - 


limCO.Ol 
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which implies the asymptotic lower bound 


lim inf 

t—y OO 


77 

H t (0) + -logf 


77 

> - - !°g O- 01 + - log(27te) 


on the posterior entropy. Comparing to the asymptotically optimal posterior entropy 
(5.1), it follows that the offline design needs asymptotically at least ({y§f ) 1 ^” = 25 times 
as many trials as the optimal adaptive design for the same accuracy. If the range [0,100] is 
doubled, then this number approximately doubles as well, so the gap to the asymptotically 
optimal adaptive design can be arbitrarily large. 


Example 5.3 (Varying cost of observation). Let us then return to the adaptive 
case and suppose that instead of a unit cost, each trial costs 

c x = i + 3[y x = o] 

units. Such a formulation could be based on the assumption that the observer takes 
four times as long to respond when the stimulus is not detected. Then, the asymptotic 
efficiency of a placement x in Theorem 4.3 is characterized by the expression 

lx (9p) _ _ Ix(0q) _ _ _1_ /r 2) 

E{C X ) 1 + 3[1 — -0(00 — x)] 5 + 5 cosh(0 o — x) — 3 sinh(0 o — x) 

This expression is maximized by the placement x = 9 q + log 2 to which the myopic algo¬ 
rithm eventually converges to (provided it is within the range [0,100]). Thus, assuming 
that dg < 100 — log2 ss 99.3069 and substituting the maximizer in (5.2), we obtain in 
Theorem 4.3 the D-optimal asymptotic efficiency B* = Comparing to the asymptoti¬ 
cally optimal placement x = 9 q for unit cost (yielding B* = in (5.2)), we see that the 
cost-aware strategy reaches the same accuracy in 10% less cost (time) in this example. 


6. Discussion 

We have derived an expression for the asymptotic efficiency of any sequential experiment 
design for both the standard framework with unit cost of observation as well as for the 
generalized framework with random costs of observation as proposed in Kujala [5]. We 
have shown an asymptotic D-optimality result for the greedy information optimization 
strategy in the standard framework and we have extended this result for the novel myopic 
strategy proposed in Kujala [5] for the situation with random costs of observations. These 
results indicate that for (almost) all true parameter values 9q, the greedy or myopic 
adaptive design is asymptotically optimal among all placement strategies in a well-defined 
sense. 

Assuming the standard sequential estimation framework with unit cost of observation, 
Lemma 3.5 together with the asymptotic normality result imply that the asymptotic 
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efficiency of any given design is characterized by the average 

ELiIxM 

t 

of the Fisher information matrices I x {6q) over the sequence of placements X t and the 
D-optimality criterion of a design refers to maximality of the determinant of this averaged 
information matrix at the limit. For any given do, there is a distribution (or sequence) 
of placements a; £ X yielding the D-optimal average information matrix. For (almost) all 
do, the placements of the greedy adaptive design converge to such an optimum, whereas 
the offline design cannot adjust the distribution of the placements a; £ X depending on 
the true value do- Thus, the offline design can be equally efficient for a given true value 
of 0, but generally not for all values do £ 0 and depending on the model, the gap in 
efficiency can be arbitrarily large as seen in Example 5.2. 

The situation is essentially the same in the framework with random costs of observa¬ 
tion, the only difference being that the convergence of the estimate of 0 is not measured 

in relation to t but in relation to the total cost Ct = Cbsy H-+ Cx t of placements. In 

this situation, the asymptotic efficiency is characterized by the ratio 

ELl-fA-Jdo) 

Efc=i E ( C X k I do) 

and the limit is again determined by the distribution (or sequence) of the placements 
x £ X. Theorem 4.3 shows that the myopic strategy of maximizing 

h(0;E) 

E t (C x ) 

yields the asymptotically D-optimal efficiency in this situation. 

However, the actual utility function assumed in both of the frameworks considered is 
the differential entropy, and so the most relevant asymptotic optimality criterion should 
be based on the asymptotic properties of the differential entropy as shown in, for example, 
Corollaries 4.5 and 4.8. Thus, a topic for future work is finding conditions under which 
the results of Corollaries 4.5 and 4.8 can be said to be optimal among all placement 
strategies. 


Appendix: Auxiliary theorems 

Theorem A.l (Stone Cech compactification). Suppose that X is a Tychonoff 
space. Then there exists a compact space /3X that embeds X as a dense subspace. Any 
continuous map f -X — > K, where K is a compact Hausdorff space, lifts uniquely to a 
continuous map /3X —> K. 

Theorem A. 2 (Martingale convergence). Let Xk be a submartingale (i.e., E(Xfc + i | 
X\,... , Xk) > Xk) and suppose that sup fc E|Xfc| < oo. Then, X = lim^oo exists almost 
surely and E|A| < oo. 





Asymptotic optimality of myopic strategies 


37 


Proof. For example, [11], Theorem B.117, page 648, or [12], Theorem 1, page 508. □ 


Theorem A.3 (A strong law of large numbers for martingales). Let Xk — Z i + 

■ • • + Zk be a martingale and let S > 0. If 


/ -J j[r.28 

k—1 


< OO, 


then Xk/k 5 0. 


Proof. For example, [2] or [12], Theorem 4, page 519. □ 

Theorem A.4 (Hoeffding Azuma inequality). Let Xk be a martingale and suppose 
that \Xk — Afc_i| < Ck for all k. Then, for all t > 0 and k € N, 

P{X n -X 0 >t}< exp f- -=!„ -j) ! 

V Z L^k=l C k/ 


and 


P{|A n -A 0 |>0<2exp - 


2ELi 


=i 


Proof. See [4], Theorem 2 and note around (2.18) on page 18, or [1], 


□ 
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